---
title: "EpiClass accurately predicts EpiATLAS assay and biospecimen metadata"
author: "Joanny Raby"
resources:
- "../resources/global_chromscore.html"
format:
html:
code-fold: true
code-tools: true
toc: true
toc-location: right
toc-expand: 2
embed-resources: true
engine: jupyter
execute:
echo: true
warning: false
eval: true
error: false
---
# Results section 2: Figures and their code
The formatting of the figures may differ slightly from those in the paper, but they display the same data points.
All code cells are folded by default. To view any cell, click **"Code"** to expand it, or use the code options near the main title above to unfold all at once.
Some code may be repeated, as the original Python notebook was designed for figures to be generated semi-independently.
## Setup Code - Imports and co.
Setup imports.
```{python}
#| label: setup-imports
from __future__ import annotations
import itertools
import re
import tarfile
from pathlib import Path
from typing import Dict, List, Sequence
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display
from plotly.subplots import make_subplots
from scipy.stats import zscore
from epiclass.utils.notebooks.paper.paper_utilities import (
ASSAY,
ASSAY_MERGE_DICT,
ASSAY_ORDER,
BIOMATERIAL_TYPE,
CELL_TYPE,
LIFE_STAGE,
SEX,
IHECColorMap,
MetadataHandler,
SplitResultsHandler,
create_mislabel_corrector,
PathChecker
)
CORE7_ASSAYS = ASSAY_ORDER[0:7]
```
Setup paths.
```{python}
#| label: setup-paths
# Root path
base_dir = Path.home() / "Projects/epiclass/output/paper"
PathChecker.check_directory(base_dir)
# More precise
base_data_dir = base_dir / "data"
base_fig_dir = base_dir / "figures"
table_dir = base_dir / "tables"
metadata_dir = base_data_dir / "metadata"
official_metadata_dir = metadata_dir / "epiatlas" / "official"
PathChecker.check_directory(official_metadata_dir)
# alias
paper_dir = base_dir
```
Setup colors.
```{python}
#| label: setup-colors
IHECColorMap = IHECColorMap(base_fig_dir)
assay_colors = IHECColorMap.assay_color_map
cell_type_colors = IHECColorMap.cell_type_color_map
sex_colors = IHECColorMap.sex_color_map
```
Setup metadata and prediction files handlers.
```{python}
#| label: setup-handlers
split_results_handler = SplitResultsHandler()
metadata_handler = MetadataHandler(paper_dir)
metadata_v2 = metadata_handler.load_metadata("v2")
metadata_v2_df = metadata_v2.to_df()
```
Setup figures general settings.
```{python}
#| label: setup-figs-settings
main_title_settings = {
"title":dict(
automargin=True,
x=0.5,
xanchor="center",
yanchor="top",
y=0.98
),
"margin":dict(t=50, l=10, r=10)
}
```
## Figure 2
### A,B - MLP performance on various classification tasks (100kb resolution) {#sec-2AB}
- 2A: Accuracy
- 2B: F1-Score
- Supp 5A: AUC scores
Define function that graphs performance accross classification tasks.
```{python}
#| label: func-nn-performance
def NN_performance_across_classification_tasks(
split_metrics: Dict[str, Dict[str, Dict[str, float]]],
name: str | None = None,
logdir: Path | None = None,
exclude_categories: List[str] | None = None,
y_range: List[float] | None = None,
sort_by_acc: bool = False,
metric_names: Sequence[str] = ("Accuracy", "F1_macro"),
title: str | None = None,
) -> List[str]:
"""Render box plots of metrics per classifier and split, each in its own subplot.
This function generates a figure with subplots, each representing a different
metric. Each subplot contains box plots for each classifier, ordered by accuracy.
Args:
split_metrics: A nested dictionary with structure {split: {classifier: {metric: score}}}.
logdir: The directory path to save the output plots. If None, only display the plot.
name: The base name for the output plot files.
exclude_categories: Task categories to exclude from the plot.
y_range: The y-axis range for the plots.
sort_by_acc: Whether to sort the classifiers by accuracy.
metric_names: The metrics to include in the plot.
Returns:
The list of classifier names in the order they appear in the plot.
"""
# Exclude some categories
classifier_names = list(split_metrics["split0"].keys())
if exclude_categories is not None:
for category in exclude_categories:
classifier_names = [c for c in classifier_names if category not in c]
available_metrics = list(split_metrics["split0"][classifier_names[0]].keys())
try:
available_metrics.remove("count")
except ValueError:
pass
if any(metric not in available_metrics for metric in metric_names):
raise ValueError(f"Invalid metric. Metrics need to be in {available_metrics}")
# Get classifier counts
classifiers_N = split_results_handler.extract_count_from_metrics(split_metrics)
# Sort classifiers by accuracy
if sort_by_acc:
mean_acc = {}
for classifier in classifier_names:
mean_acc[classifier] = np.mean(
[split_metrics[split][classifier]["Accuracy"] for split in split_metrics]
)
classifier_names = sorted(
classifier_names, key=lambda x: mean_acc[x], reverse=True
)
# Create subplots, one column for each metric
fig = make_subplots(
rows=1,
cols=len(metric_names),
subplot_titles=metric_names,
horizontal_spacing=0.03,
)
color_group = px.colors.qualitative.Plotly
colors = {
classifier: color_group[i % len(color_group)]
for i, classifier in enumerate(classifier_names)
}
point_pos = 0
for i, metric in enumerate(metric_names):
for classifier_name in classifier_names:
values = [
split_metrics[split][classifier_name][metric] for split in split_metrics
]
fig.add_trace(
go.Box(
y=values,
name=f"{classifier_name} (N={classifiers_N[classifier_name]})",
fillcolor=colors[classifier_name],
line=dict(color="black", width=1.5),
marker=dict(size=3, color="white", line_width=1),
boxmean=True,
boxpoints="all",
pointpos=point_pos,
showlegend=i == 0, # Only show legend in the first subplot
hovertemplate="%{text}",
text=[
f"{split}: {value:.4f}"
for split, value in zip(split_metrics, values)
],
legendgroup=classifier_name,
width=0.5,
),
row=1,
col=i + 1,
)
# Title
title_text = (
"Neural network classification - Metric distribution for 10-fold cross-validation"
)
if title:
title_text = title
fig.update_layout(title_text=title_text, **main_title_settings)
# Layout
fig.update_layout(
yaxis_title="Value",
boxmode="group",
height=1200 * 0.8,
width=1750 * 0.8,
)
# Acc, F1
fig.update_layout(yaxis=dict(range=[0.88, 1.001]))
fig.update_layout(yaxis2=dict(range=[0.80, 1.001]))
# AUC
range_auc = [0.986, 1.0001]
fig.update_layout(yaxis3=dict(range=range_auc))
fig.update_layout(yaxis4=dict(range=range_auc))
if y_range is not None:
fig.update_yaxes(range=y_range)
# Save figure
if logdir:
if name is None:
name = "MLP_metrics_various_tasks"
fig.write_image(logdir / f"{name}.svg")
fig.write_image(logdir / f"{name}.png")
fig.write_html(logdir / f"{name}.html")
fig.show()
return classifier_names
```
Compute metrics.
```{python}
#| label: load-split-metrics
exclude_categories = ["track_type", "group", "disease", "PE", "martin"]
# exclude_categories = ["track_type", "group", "disease"]
exclude_names = ["chip-seq", "7c", "16ct", "no-mixed"]
hdf5_type = "hg38_100kb_all_none"
results_dir = base_data_dir / "training_results" / "dfreeze_v2" / hdf5_type
PathChecker.check_directory(results_dir)
mislabel_correction = True
if mislabel_correction:
mislabel_corrector = create_mislabel_corrector(paper_dir)
else:
mislabel_corrector = None
split_results_metrics, all_split_results = split_results_handler.general_split_metrics(
results_dir,
merge_assays=True,
exclude_categories=exclude_categories,
exclude_names=exclude_names,
return_type="both",
oversampled_only=True,
mislabel_corrections=mislabel_corrector,
verbose=False,
)
```
Graph metrics.
```{python}
#| label: plot-nn-performance
#| column: screen-inset-left
metrics_full = ["Accuracy", "F1_macro", "AUC_micro", "AUC_macro"]
fig_name = f"{hdf5_type}_perf_across_categories_full"
sorted_task_names = NN_performance_across_classification_tasks(
split_results_metrics, # type: ignore
sort_by_acc=True,
metric_names=metrics_full,
)
```
Fig. 2A,B: Distribution of accuracy and F1-score evaluated per training fold (dots) for each metadata classifier. Performance metrics are reported without applying a prediction score threshold. Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.
### C - Class imbalance (Shannon Entropy)
Define function `compute_class_imbalance`.
```{python}
#| label: func-compute-imbalance
def compute_class_imbalance(
all_split_results: Dict[str, Dict[str, pd.DataFrame]],
) -> pd.DataFrame:
"""Compute class imbalance for each task and split.
Args:
all_split_results: A dictionary with structure {task_name: {split_name: split_results_df}}.
Returns:
pd.DataFrame: A DataFrame with the following columns:
- avg(balance_ratio): The average balance ratio for each task.
- n: The number of classes for each task (used for the average).
"""
# combine md5 lists
task_md5s = {
classifier_task: [split_df.index for split_df in split_results.values()]
for classifier_task, split_results in all_split_results.items()
}
task_md5s = {
classifier_task: [list(split_md5s) for split_md5s in md5s]
for classifier_task, md5s in task_md5s.items()
}
task_md5s = {
classifier_task: list(itertools.chain(*md5s))
for classifier_task, md5s in task_md5s.items()
}
# get metadata
metadata_df = metadata_handler.load_metadata_df("v2-encode")
label_counts = {}
for classifier_task, md5s in task_md5s.items():
try:
label_counts[classifier_task] = metadata_df.loc[md5s][
classifier_task
].value_counts()
except KeyError as e:
category_name = classifier_task.rsplit("_", maxsplit=1)[0]
try:
label_counts[classifier_task] = metadata_df.loc[md5s][
category_name
].value_counts()
except KeyError as e:
raise e
# Compute Shannon Entropy
class_balance = {}
for classifier_task, counts in label_counts.items():
total_count = counts.sum()
k = len(counts)
p_x = counts / total_count # class proportions
p_x = p_x.values
shannon_entropy = -np.sum(p_x * np.log2(p_x))
balance = shannon_entropy / np.log2(k)
class_balance[classifier_task] = (balance, k)
df_class_balance = pd.DataFrame.from_dict(
class_balance, orient="index", columns=["Normalized Shannon Entropy", "k"]
).sort_index()
return df_class_balance
```
Compute class imbalance (Shannon entropy).
```{python}
#| label: compute-imbalance
subset = {
k: v
for k, v in all_split_results.items() # type: ignore
if not any(label in k for label in ["martin", "PE"])
}
df_class_balance = compute_class_imbalance(subset) # type: ignore
```
Define graphing function `plot_shannon_entropy`.
```{python}
#| label: fct-shannon-plot
#| column: body-outset-left
def plot_shannon_entropy(class_balance_df: pd.DataFrame, ordered_task_names: List[str]|None) -> None:
"""Graph Shannon entropy values, in the order given by task_names"""
df = class_balance_df.copy()
# Reorder df
task_names = df.index
if ordered_task_names:
task_names = [
task_name for task_name in task_names if task_name in ordered_task_names
]
df = df.loc[sorted_task_names]
# plot
fig = px.scatter(
df,
x=df.index,
y="Normalized Shannon Entropy",
labels={
"k": "Number of classes",
"Normalized Shannon Entropy": "Normalized Shannon Entropy",
},
title="Class imbalance across tasks (higher is more balanced)",
)
fig.update_layout(
yaxis=dict(range=[0, 1]),
xaxis_title=None,
)
fig.show()
```
Graph.
```{python}
#| label: plot-shannon-entropy
#| column: body-outset-left
plot_shannon_entropy(
class_balance_df=df_class_balance,
ordered_task_names=sorted_task_names,
)
```
Fig. 2C: Shannon entropy scores for each metadata category.
### D - Donor Sex classification
`IHEC_sample_metadata_harmonization.v1.1.extended.csv` contains 314 EpiRRs with unknown sex. We applied a fully trained sex classifier on those.
To properly create sex classification related graphs (D,E) we need
- Classifier predictions on samples with unknown sex label
- Metadata pre/post correction
- Average ChrY signal for each file
Load v1.1 and v1.2 official metadata.
```{python}
#| label: load-v1-metadata
metadata_v1_1_path = (
official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.1.extended.csv"
)
metadata_v1_1 = pd.read_csv(metadata_v1_1_path, index_col=0)
metadata_v1_2_path = (
official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.2.extended.csv"
)
metadata_v1_2 = pd.read_csv(metadata_v1_2_path, index_col=0)
```
Sanity check: make sure that the number of unknown labels is the same in our metadata VS official v1.1
```{python}
#| label: check-unknown-sex-counts
full_metadata_df = metadata_v2_df
full_metadata_df["md5sum"] = full_metadata_df.index
assert (
metadata_v2_df[metadata_v2_df[SEX].isin(["unknown"])]["EpiRR"].nunique()
== metadata_v1_1[metadata_v1_1[SEX] == "unknown"].index.nunique()
== 314
)
```
Load predictions for unknown sex samples.
```{python}
#| label: load-unknown-sex-predictions
sex_results_dir = (
base_data_dir
/ "training_results"
/ "dfreeze_v2"
/ "hg38_100kb_all_none"
/ f"{SEX}_1l_3000n"
)
sex_full_model_dir = sex_results_dir / "complete_no_valid_oversample"
PathChecker.check_directory(sex_full_model_dir)
pred_unknown_file_path = (
sex_full_model_dir
/ "predictions"
/ "complete_no_valid_oversample_test_prediction_100kb_all_none_dfreeze_v2.1_sex_mixed_unknown.csv"
)
pred_unknown_df = pd.read_csv(pred_unknown_file_path, index_col=0, header=0)
```
Join metadata to predictions.
```{python}
#| label: process-unknown-sex-predictions
pred_unknown_df = pred_unknown_df[pred_unknown_df["True class"] == "unknown"]
pred_unknown_df = split_results_handler.add_max_pred(pred_unknown_df) # type: ignore
pred_unknown_df = metadata_handler.join_metadata(pred_unknown_df, metadata_v2)
pred_unknown_df["md5sum"] = pred_unknown_df.index
```
Load 10-fold cross validation results.
```{python}
#| label: load-sex-10fold-results
sex_10fold_dir = sex_results_dir / "10fold-oversampling"
PathChecker.check_directory(sex_10fold_dir)
split_results: Dict[str, pd.DataFrame] = split_results_handler.read_split_results(
sex_10fold_dir
)
concat_results_10fold: pd.DataFrame = split_results_handler.concatenate_split_results(split_results, depth=1) # type: ignore
concat_results_10fold = split_results_handler.add_max_pred(concat_results_10fold)
concat_results_10fold = metadata_handler.join_metadata(concat_results_10fold, metadata_v2)
```
#### Inner portion
Fig. 2D: Proportion of donor sex metadata originally annotated (inner circle, metadata v1.1) and predicted with high-confidence (outer circle, metadata v2.0) for female (red) and male (blue) (mixed sex not shown).
Compute values for the inner portion of the pie chart.
```{python}
#| label: calc-sex-proportion-v1-1
# Proportion of unknown, excluding mixed. same as v1.1 ihec metadata
no_mixed = full_metadata_df[full_metadata_df[SEX] != "mixed"]
with pd.option_context("display.float_format", "{:.2%}".format):
print("file-wise:")
print(no_mixed[SEX].value_counts(dropna=False) / no_mixed.shape[0])
print("\nEpiRR-wise:")
epirr_no_mixed = no_mixed.drop_duplicates(subset=["EpiRR"])
print(epirr_no_mixed[SEX].value_counts(dropna=False) / epirr_no_mixed.shape[0])
```
#### Outer portion
Outer ring represents SEX metadata labnels v1.2 (without `mixed` labels), which had those modifications:
- Some unknown SEX files were labelled, using (assay,track type) z-score in conjunction with fully trained model predictions.
- Correction of some mislabels, using 10fold cross-validation results
```{python}
#| label: calc-sex-proportion-v1-2
meta_v1_2_no_mixed = metadata_v1_2[metadata_v1_2[SEX] != "mixed"]
with pd.option_context("display.float_format", "{:.2%}".format):
print("EpiRR-wise:")
print(meta_v1_2_no_mixed.value_counts(SEX) / meta_v1_2_no_mixed.shape[0])
```
#### Average chrY values z-score distributions
Work done in another script:
1. For each bigwig file, the chrY average value is computed with the pyBigWig module, in `chrY_bigwig_mean.py` ([permalink](https://github.com/rabyj/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/chrY_bigwig_mean.py)).
2. For each assay, the z-score distribution (of the mean chrY value) of the file group is computed.
- Outputs `chrXY_all.csv`
- Outputs `chrY_zscores.csv`
Fig. 2E is made by averaging for each EpiRR the z-score value in each assay distribution.
This data is needed for the full sex mislabel context table.
\
\
Define function `compute_chrY_zscores`.
```{python}
#| label: func-compute-chry-zscores
def compute_chrY_zscores(
chrY_dir: Path, version: str, save: bool = False
) -> pd.DataFrame:
"""Compute z-scores for chrY signal data.
Computes two distributions of z-scores:
1) Per assay group, excluding raw, pval, and Unique_raw tracks.
2) Per assay+track group.
In both cases, rna-seq/mrna-seq and wgbs-standard/wgbs-pbat are put as one assay.
Args:
chrY_dir: The directory containing the chrY signal data.
version: The metadata version to use.
save: Whether to save the results.
Returns:
pd.DataFrame: The chrY signal data with z-scores appended.
"""
output_dir: Path = chrY_dir
if save:
output_dir = chrY_dir / f"dfreeze_{version}_stats"
output_dir.mkdir(parents=False, exist_ok=True)
# Get chrY signal data
chrY_df = pd.read_csv(chrY_dir / "chrXY_all.csv", header=0)
# Filter out md5s not in metadata version
metadata = MetadataHandler(paper_dir).load_metadata(version)
md5s = set(metadata.md5s)
chrY_df = chrY_df[chrY_df["filename"].isin(md5s)]
# Make sure all values are non-zero
if not (chrY_df["chrY"] != 0).all():
raise ValueError("Some chrY values are zero.")
# Merge metadata
metadata_df = pd.DataFrame.from_records(list(metadata.datasets))
metadata_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)
chrY_df = chrY_df.merge(
metadata_df, left_on="filename", right_on="md5sum"
)
# Compute stats for distributions
metric_name_1 = "chrY_zscore_vs_assay_w_track_exclusion"
metric_name_2 = "chrY_zscore_vs_assay_track"
files1 = chrY_df[
~chrY_df["track_type"].isin(["raw", "pval", "Unique_raw"])
]
files2 = chrY_df
dist1 = files1.groupby(ASSAY).agg({"chrY": ["mean", "std", "count"]})
dist2 = files2.groupby([ASSAY, "track_type"]).agg({"chrY": ["mean", "std", "count"]})
if save:
dist1.to_csv(output_dir / "chrY_stats_assay_w_track_exclusion.csv")
dist2.to_csv(output_dir / "chrY_stats_assay_and_track.csv")
# Compute full z-score distributions
for groups in files1.groupby(ASSAY):
_, group_df = groups
group_df["zscore"] = zscore(group_df["chrY"])
chrY_df.loc[group_df.index, metric_name_1] = group_df["zscore"]
chrY_df.loc[group_df.index, f"N_{metric_name_1}"] = groups[1].shape[0]
for groups in files2.groupby([ASSAY, "track_type"]):
_, group_df = groups
group_df["zscore"] = zscore(group_df["chrY"])
chrY_df.loc[group_df.index, metric_name_2] = group_df["zscore"]
chrY_df.loc[group_df.index, f"N_{metric_name_2}"] = groups[1].shape[0]
# Fill in missing values
for N_name in [f"N_{metric_name_1}", f"N_{metric_name_2}"]:
chrY_df[N_name] = chrY_df[N_name].fillna(0).astype(int)
chrY_df.fillna(pd.NA, inplace=True)
if save:
output_cols = [
"filename",
ASSAY,
"track_type",
"chrY",
metric_name_1,
f"N_{metric_name_1}",
metric_name_2,
f"N_{metric_name_2}",
]
chrY_df[output_cols].to_csv(
output_dir / "chrY_zscores.csv", index=False, na_rep="NA" # type: ignore
)
return chrY_df
```
Compute chrY zscores.
```{python}
#| label: compute-chry-zscores
chrY_dir = base_data_dir / "chrY"
PathChecker.check_directory(chrY_dir)
chrY_df = compute_chrY_zscores(chrY_dir, "v2", save=False)
```
#### Unknown predictions analysis file
The following folded code generates a similar table to what was used to determine which unknown sex sample to label.
It was not used to produce any graph.
```{python}
#| label: analyze-unknown-predictions
def create_sex_pred_pivot_table(pred_unknown: pd.DataFrame, chrY_df: pd.DataFrame) -> pd.DataFrame:
"""Generate a pivot table containing group metrics per predicted label, for each EpiRR."""
index_cols = [
"EpiRR",
"project",
"harmonized_donor_type",
CELL_TYPE,
SEX,
"Predicted class",
]
pred_plus_chrY_df = pd.merge(
pred_unknown,
chrY_df,
on="md5sum",
suffixes=("", "_DROP")
)
pred_plus_chrY_df.drop(
columns=[c for c in pred_plus_chrY_df.columns if c.endswith("_DROP")],
inplace=True,
)
val_cols = ["Max pred", "chrY_zscore_vs_assay_track"]
pivot_table = pred_plus_chrY_df.pivot_table(
index=index_cols,
values=val_cols,
aggfunc=["mean", "median", "std", "count"],
)
return pivot_table
create_sex_pred_pivot_table(
pred_unknown=pred_unknown_df,
chrY_df=chrY_df,
)
```
\
#### 10fold cross-validation predictions analysis file
This section generates a similar table to what was used to determine which EpiRR sex labels might be mistaken.
It aggregates results from the 10 cross-validation classifiers.
```{python}
#| label: analyze-10fold-predictions
cross_val_analysis = concat_results_10fold.merge(chrY_df, left_index=True, right_on="md5sum", suffixes=("", "_DROP")) # type: ignore
cross_val_analysis.drop(
columns=[c for c in cross_val_analysis.columns if c.endswith("_DROP")], inplace=True
)
```
```{python}
#| label: pivot-10fold-predictions
index_cols = [
"EpiRR",
"project",
"harmonized_donor_type",
CELL_TYPE,
SEX,
"Predicted class",
]
val_cols = ["Max pred", "chrY_zscore_vs_assay_track"]
# not directly used in full mislabel analysis
to_drop = [
"N_chrY_zscore_vs_assay_w_track_exclusion",
"chrY_zscore_vs_assay_w_track_exclusion",
]
cross_val_analysis_track = cross_val_analysis.drop(to_drop, axis=1)
pivot_table = cross_val_analysis_track.pivot_table(
index=index_cols,
values=val_cols,
aggfunc=["mean", "median", "std", "count"],
)
display(pivot_table.head())
```
### E - Average EpiRR z-score for 10fold
Define function `zscore_merged_assays` that computes the chrY average signal metric.
```{python}
#| label: func-zscore-merged-assays
def zscore_merged_assays(
zscore_df: pd.DataFrame,
sex_mislabels: Dict[str, str],
name: str | None = None,
logdir: Path | None = None,
min_pred: float | None = None,
no_rna: bool = False,
) -> None:
"""Male vs Female z-score distribution for merged assays, excluding wgbs.
Does not include pval and raw tracks.
Highlights mislabels in the plot.
Args:
zscore_df (pd.DataFrame): The dataframe with z-score data.
sex_mislabels (Dict[str, str]): {EpiRR_no-v: corrected_sex_label}
logdir (Path): The directory path to save the output plots. If None, only display the plot.
name (str): The base name for the output plot files.
min_pred (float|None): Minimum prediction value to include in the plot. Used on average EpiRR 'Max pred' values.
no_rna (bool): Whether to exclude rna_seq from the plot.
"""
zscore_df = zscore_df.copy(deep=True)
# Remove pval/raw tracks + rna unstranded
zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw", "Unique_raw"])]
# Merge rna protocols
zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)
# wgbs reverses male/female chrY tendency, so removed here
zscore_df = zscore_df[~zscore_df[ASSAY].str.contains("wgb")]
if no_rna:
zscore_df = zscore_df[~zscore_df[ASSAY].str.contains("rna")]
N_assays = len(zscore_df[ASSAY].unique())
print(
f"Average chrY z-score values computed from:\n{zscore_df[ASSAY].value_counts(dropna=False)}"
)
# Average chrY z-score values
metric_label = "chrY_zscore_vs_assay_w_track_exclusion"
zscore_df = zscore_df[zscore_df[metric_label] != "NA"]
mean_chrY_values_df = zscore_df.groupby(["EpiRR", SEX]).agg(
{metric_label: "mean", "Max pred": "mean"}
)
mean_chrY_values_df.reset_index(inplace=True)
if not mean_chrY_values_df["EpiRR"].is_unique:
raise ValueError("EpiRR is not unique.")
# Filter out low prediction values
if min_pred is not None:
mean_chrY_values_df = mean_chrY_values_df[
mean_chrY_values_df["Max pred"] > min_pred
]
mean_chrY_values_df.reset_index(drop=True, inplace=True)
mean_chrY_values_df["EpiRR_no_v"] = mean_chrY_values_df["EpiRR"].str.extract(
pat=r"(\w+\d+).\d+"
)[0]
chrY_values = mean_chrY_values_df[metric_label]
female_idx = np.argwhere((mean_chrY_values_df[SEX] == "female").values).flatten() # type: ignore
male_idx = np.argwhere((mean_chrY_values_df[SEX] == "male").values).flatten() # type: ignore
# Mislabels
predicted_as_female = set(
epirr_no_v for epirr_no_v, label in sex_mislabels.items() if label == "female"
)
predicted_as_male = set(
epirr_no_v for epirr_no_v, label in sex_mislabels.items() if label == "male"
)
predicted_as_female_idx = np.argwhere(mean_chrY_values_df["EpiRR_no_v"].isin(predicted_as_female).values).flatten() # type: ignore
predicted_as_male_idx = np.argwhere(mean_chrY_values_df["EpiRR_no_v"].isin(predicted_as_male).values).flatten() # type: ignore
print(
f"Adding mislabels to graph: {len(predicted_as_female_idx)} male->female, {len(predicted_as_male_idx)} female->male"
)
# Hovertext
hovertext = [
f"{epirr}: <z-score>={z_score:.3f}"
for epirr, z_score in zip(
mean_chrY_values_df["EpiRR"],
mean_chrY_values_df[metric_label],
)
]
hovertext = np.array(hovertext)
# sanity check
if mean_chrY_values_df["EpiRR"].nunique() != mean_chrY_values_df.shape[0]:
raise ValueError("EpiRR is not unique.")
# Create figure
fig = go.Figure()
fig.add_trace(
go.Box(
name="Female",
x=[0] * len(female_idx),
y=chrY_values[female_idx], # type: ignore
boxmean=True,
boxpoints="all",
pointpos=-2,
hovertemplate="%{text}",
text=hovertext[female_idx],
marker=dict(size=2),
line=dict(width=1, color="black"),
fillcolor=sex_colors["female"],
showlegend=True,
),
)
fig.add_trace(
go.Box(
name="Male",
x=[1] * len(female_idx),
y=chrY_values[male_idx], # type: ignore
boxmean=True,
boxpoints="all",
pointpos=-2,
hovertemplate="%{text}",
text=hovertext[male_idx],
marker=dict(size=2),
line=dict(width=1, color="black"),
fillcolor=sex_colors["male"],
showlegend=True,
),
)
fig.add_trace(
go.Scatter(
name="Male",
x=[-0.5] * len(predicted_as_male_idx),
y=chrY_values[predicted_as_male_idx], # type: ignore
mode="markers",
marker=dict(
size=10, color=sex_colors["male"], line=dict(width=1, color="black")
),
hovertemplate="%{text}",
text=hovertext[predicted_as_male_idx],
showlegend=False,
),
)
fig.add_trace(
go.Scatter(
name="Female",
x=[0.5] * len(predicted_as_female_idx),
y=chrY_values[predicted_as_female_idx], # type: ignore
mode="markers",
marker=dict(
size=10, color=sex_colors["female"], line=dict(width=1, color="black")
),
hovertemplate="%{text}",
text=hovertext[predicted_as_female_idx],
showlegend=False,
),
)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(range=[-1.5, 3])
title = f"z-score(mean chrY signal per file) distribution - z-scores averaged over {N_assays} assays"
if min_pred is not None:
title += f"<br>avg_maxPred>{min_pred}"
fig.update_layout(
title=dict(text=title, x=0.5),
xaxis_title=SEX,
yaxis_title="Average z-score",
width=750,
height=750,
)
# Save figure
if logdir:
this_name = f"{metric_label}_n{mean_chrY_values_df.shape[0]}"
if name:
this_name = f"{name}_{this_name}"
fig.write_image(logdir / f"{this_name}.svg")
fig.write_image(logdir / f"{this_name}.png")
fig.write_html(logdir / f"{this_name}.html")
fig.show()
```
Graph zscores.
```{python}
#| label: plot-zscore-merged-assays
#| column: body-outset-left
zscore_merged_assays(
zscore_df=cross_val_analysis,
sex_mislabels=create_mislabel_corrector(paper_dir)[1][SEX],
name="no_RNA",
no_rna=True,
)
```
Fig 2E: Distribution of the average z-score signal of epigenomes (dots) over chrY, computed on the ChIP-Seq datasets (up to 7 assays per epigenome) using the fold change track type files for female (red) and male (blue). Originally mislabeled epigenomes are shown as big dots. Boxplot elements are as in [Fig. 2A](./fig2.html#sec-2AB).
### F - Donor life stage and GP-Age
Load official IHEC sample metadata.
```{python}
#| label: load-v1-metadata-again
meta_v1_2_df = pd.read_csv(
official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.2.extended.csv"
)
meta_v1_1_df = pd.read_csv(
official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.1.extended.csv"
)
```
Sanity check, both metadata versions have the same 'cell line' samples.
```{python}
#| label: check-cell-lines
cell_lines_v11 = meta_v1_1_df[meta_v1_1_df[BIOMATERIAL_TYPE] == "cell line"][
"epirr_id_without_version"
].unique()
cell_lines_v12 = meta_v1_2_df[meta_v1_2_df[BIOMATERIAL_TYPE] == "cell line"][
"epirr_id_without_version"
].unique()
assert set(cell_lines_v11) == set(cell_lines_v12)
```
Load GP-age results.
```{python}
#| label: load-gpage-data
gp_age_dir = base_data_dir / "GP_age"
PathChecker.check_directory(gp_age_dir)
df_gp_age = pd.read_csv(gp_age_dir / "life_stage_prediction.GPage.20250513.tsv", sep="\t")
df_gp_age["graph_age"] = df_gp_age["model30"]
epiclass_pred_col = "epiclass_predicted_lifestage"
display(df_gp_age.head())
```
The `epiclass_predicted_lifestage` column is manually curated, from analyzing all predictions for all samples.
When an expected label is known, and epiclass predictions are inconclusive (low average max pred and/or no majority consensus), the expected label is kept.
Columns `tissue` and `tissue_subtype` are formatting of `harmonized_sample_organ_system_order_AnetaMikulasova` and `harmonized_sample_organ_order_AnetaMikulasova`.
\
Following code remaps 5 life stage classes to 3 (perinatal, pediatric, adult).
```{python}
#| label: process-gpage-data
df_gp_age.loc[:, "graph_age_categories"] = df_gp_age[epiclass_pred_col].str.removesuffix(
"_pred"
)
gp_age_categories = {
"adult": "adult",
"child": "pediatric",
"embryonic": "perinatal",
"fetal": "perinatal",
"newborn": "perinatal",
"unknown": "unknown",
}
df_gp_age.loc[:, "graph_age_categories"] = df_gp_age["graph_age_categories"].map(
gp_age_categories
)
display(df_gp_age["graph_age_categories"].value_counts(dropna=False))
```
Here we merge GP-age data with additional EpiAtlas metadata.
```{python}
#| label: merge-gpage-metadata
epirr_col = "epirr_id_without_version"
merged_dp_age = pd.merge(
df_gp_age,
meta_v1_2_df[[epirr_col, CELL_TYPE, BIOMATERIAL_TYPE]],
left_on="epirr_noV",
right_on=epirr_col,
how="left",
)
merged_dp_age.drop_duplicates(subset=[epirr_col], inplace=True)
merged_dp_age.drop(columns=["epirr_noV"], inplace=True)
```
Excluding 'cell line' samples from considered data.
```{python}
#| label: remove-cell-lines-gpage
# Removing cell lines: life stage makes less sense
# type: ignore
N_before = merged_dp_age.shape[0]
merged_dp_age: pd.DataFrame = merged_dp_age[
merged_dp_age[BIOMATERIAL_TYPE] != "cell line"
]
print(f"Removed {N_before - merged_dp_age.shape[0]} cell lines.")
```
Creating the graph categories.
We need to categorize separately whole blood since from other tissues since GP-Age training is only made of whole blood. We keep 'immune system' tissues, specifically venous and umbilical blood since they match the most closely to the training data. unsure about blood marrow.
```{python}
#| label: create-gpage-tissue-groups
layer1_vals = ["IMMU"]
layer2_vals = ["blood-umbilical-cord", "blood-venous"]
merged_dp_age.loc[:, "tissue_group"] = [
"blood" if (val1 in layer1_vals and val2 in layer2_vals) else "other"
for val1, val2 in merged_dp_age.loc[:, ["tissue", "tissue_subtype"]].values
]
```
Sanity check, no NaN present.
```{python}
#| label: check-gpage-nan
important_cols = [
"epirr_id_without_version",
"tissue_group",
epiclass_pred_col,
"graph_age",
]
missing_N = merged_dp_age.loc[:, important_cols].isna().sum().sum()
if missing_N > 0:
raise ValueError(f"Missing values in merged_dp_age: {missing_N}")
```
Define graphing function `graph_gp_age`.
```{python}
#| label: func-graph-gpage
def graph_gp_age(
df_gp_age: pd.DataFrame,
logdir: Path | None = None,
name: str | None = None,
title: str | None = None,
) -> None:
"""
Plot the GP age predictions.
Args:
df_gp_age: The dataframe with GP age data.
"""
df = df_gp_age.copy(deep=True)
tissue_colors = {"blood": "red", "other": "gray"}
age_cat_label = "graph_age_categories"
fig = go.Figure()
for tissue_group in sorted(df["tissue_group"].unique()):
sub_df = df[df["tissue_group"] == tissue_group]
fig.add_trace(
go.Box(
name=f"{tissue_group} (n={len(sub_df)})",
x=sub_df[age_cat_label],
y=sub_df["graph_age"],
boxmean=True,
boxpoints="all",
hovertemplate="%{text}",
text=[
f"{ct}: {age:.3f}"
for ct, age in zip(sub_df[CELL_TYPE], sub_df["graph_age"])
],
marker=dict(size=2, color=tissue_colors[tissue_group]),
showlegend=True,
),
)
fig.update_layout(
title=f"GP age predictions - Using MLP predicted labels ({title})",
xaxis_title="Life stage",
yaxis_title="GP-Age : Predicted age",
width=750,
height=750,
boxmode="group",
)
# Order x-axis
label_order = ["perinatal", "pediatric", "adult"]
axis_labels = [
f"{age_cat} (n={(df[age_cat_label] == age_cat).sum()})" for age_cat in label_order
]
fig.update_xaxes(categoryorder="array", categoryarray=label_order)
fig.update_xaxes(tickvals=[0, 1, 2], ticktext=axis_labels)
# Save figure
if logdir:
if name is None:
name = "GP_age_predictions"
fig.write_image(logdir / f"{name}.svg")
fig.write_image(logdir / f"{name}.png")
fig.write_html(logdir / f"{name}.html")
fig.show()
```
Define dataframe content function `display_tissue_age_df`.
```{python}
#| label: func-display-tissue-age-df
def display_tissue_age_df(df: pd.DataFrame) -> None:
"""Display tissue group count breakdown."""
count = df.groupby(["tissue_group", "graph_age_categories"]).size()
print("Tissue group summary")
print(f"Blood: {count.loc['blood'].sum()}") # type: ignore
print(f"Other: {count.loc['other'].sum()}") # type: ignore
print(f"Total: {count.sum()}")
print("Detail:")
print(count)
```
For initially unknown labels, only predictions that were manually confirmed as mislabels.
```{python}
#| label: plot-gpage-unknown-samples
#| column: body-outset-left
#|
# Remove inconclusive predictions
gp_graph_df = merged_dp_age[merged_dp_age[epiclass_pred_col] != "unknown"]
only_unknown_df: pd.DataFrame = gp_graph_df[gp_graph_df[LIFE_STAGE] == "unknown"]
display_tissue_age_df(only_unknown_df)
graph_gp_age(
df_gp_age=only_unknown_df,
name="GP_age_predictions_unknown_only_MLP_predicted_labels",
title="Samples with unknown life stage",
)
```
Fig 2F: Distribution of the age prediction from GP-age for the WGBS datasets with an originally unknown life stage predicted by EpiClass (datasets related to blood biospecimen are in red, others in grey). Boxplot elements are as in [Fig. 2A](./fig2.html#sec-2AB).
#### Supplementary Figure 5D - GP-age {#sec-supp-5D}
Considering all samples that had a conclusive epiclass prediction (or existing label and no conclusion)
```{python}
#| label: plot-gpage-all-samples
#| column: body-outset-left
display_tissue_age_df(gp_graph_df)
graph_gp_age(
df_gp_age=gp_graph_df,
name="GP_age_predictions_all_samples_MLP_predicted_labels",
title="All samples (has a life stage pred + gp-age)",
)
```
Supp. Fig. 5D: Distribution of the age prediction from GP-age for the 565 epigenomes (dots) with conclusive consensus predictions (epigenomes related to blood biospecimen are in red, others in grey). The perinatal category encompasses the original embryonic, fetal and newborn categories of metadata as they individually contain too few samples. Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.
### G - Biospecimen classifier - ChromScore for high-SHAP regions {#sec-2G}
For full code, since the processing is more complex, see `src/python/epiclass/utils/notebooks/analyze_hdf5_vals.ipynb` ([permalink](https://github.com/rabyj/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/analyze_hdf5_vals.ipynb)), particularly section "ChromScore hdf5 values".
::: {.text-center}
<iframe
src="../resources/global_chromscore.html"
width="510"
height="510"
></iframe>
:::
Fig 2G: Distribution of the average ChromScore values over the important Biospecimen classifier regions according to SHAP (blue) compared to the global distribution (pink). Statistical significance was assessed using a two-sided Welch's t-test. Boxplot elements are as in [Fig. 2A](./fig2.html#sec-2AB), with a violin representation on top.
### H - Gene Ontology and SHAP values {#sec-2H}
See `src/python/epiclass/utils/notebooks/profile_bed.ipynb` ([permalink](https://github.com/rabyj/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/profile_bed.ipynb)) for creation of gene ontology files.
The code compares important SHAP values regions of different cell types with gene gff (using the gProfiler module)
Define GO terms.
```{python}
#| label: define-go-terms
selected_cell_types = [
"T_cell",
"neutrophil",
"lymphocyte_of_B_lineage",
"brain",
"hepatocyte",
]
go_terms_table = [
"T cell receptor complex",
"plasma membrane signaling receptor complex",
"adaptive immune response",
"receptor complex",
"secretory granule",
"secretory vesicle",
"secretory granule membrane",
"intracellular vesicle",
"immunoglobulin complex",
"immune response",
"immune system process",
"homophilic cell adhesion via plasma membrane adhesion molecules",
"DNA binding",
"cell-cell adhesion via plasma-membrane adhesion molecules",
"RNA polymerase II cis-regulatory region sequence-specific DNA binding",
"blood microparticle",
"platelet alpha granule lumen",
"fibrinogen complex",
"endoplasmic reticulum lumen",
]
# Add some line breaks for readability
go_terms_graph = [
"T cell receptor complex",
"plasma membrane<br>signaling receptor complex",
"adaptive immune response",
"receptor complex",
"secretory granule",
"secretory vesicle",
"secretory granule membrane",
"intracellular vesicle",
"immunoglobulin complex",
"immune response",
"immune system process",
"homophilic cell adhesion via<br>plasma membrane adhesion molecules",
"DNA binding",
"cell-cell adhesion via<br>plasma-membrane adhesion molecules",
"RNA polymerase II cis-regulatory<br>region sequence-specific DNA binding",
"blood microparticle",
"platelet alpha granule lumen",
"fibrinogen complex",
"endoplasmic reticulum lumen",
]
```
Setup paths.
```{python}
#| label: load-go-bed-files
hdf5_type = "hg38_100kb_all_none"
SHAP_dir = base_data_dir / "SHAP"
cell_type_shap_dir = (
SHAP_dir / hdf5_type / f"{CELL_TYPE}_1l_3000n" / "10fold-oversampling"
)
beds_file = cell_type_shap_dir / "select_beds_top303.tar.gz"
PathChecker.check_file(beds_file)
```
Load top g:profiler results for each cell type.
```{python}
#| label: read-go-dfs
all_go_dfs: Dict[str, pd.DataFrame] = {}
with tarfile.open(beds_file, "r:gz") as tar:
for member in tar.getmembers():
filename = member.name
if filename.endswith("profiler.tsv") and "merge_samplings" in filename:
with tar.extractfile(member) as f: # type: ignore
go_df = pd.read_csv(f, sep="\t", index_col=0)
all_go_dfs[member.name] = go_df
assert len(all_go_dfs) == 16
```
Merge results for all cell types into one table.
```{python}
#| label: process-go-dfs
for name, df in all_go_dfs.items():
sub_df = df.copy()
sub_df.loc[:, "shap_source"] = re.match(r".*/merge_samplings_(.*)_features_intersect_gff_gprofiler.tsv", name).group(1) # type: ignore
sub_df.loc[:, "table_val"] = -np.log10(sub_df.loc[:, "p_value"])
all_go_dfs[name] = sub_df
full_concat_df = pd.concat(all_go_dfs.values())
full_concat_df = full_concat_df.drop(["significant", "query"], axis=1)
assert all(go_term in full_concat_df["name"].values for go_term in go_terms_table)
table = full_concat_df.pivot_table(
index="name", columns="shap_source", values="table_val", aggfunc="mean"
)
```
Collect subset of table for the five selected cell types.
```{python}
#| label: subset-go-table
table_5_ct = table.loc[go_terms_table, selected_cell_types].copy()
assert table_5_ct.shape == (len(go_terms_graph), len(selected_cell_types))
# Rename index
table_5_ct = pd.DataFrame(table_5_ct, index=go_terms_graph, columns=table_5_ct.columns)
```
Define graphing function `plot_go_heatmap`.
```{python}
#| label: define-plot-fn
def plot_go_heatmap(table: pd.DataFrame, width: int, height: int, title: str|None=None):
"""Plot a GO heatmap."""
# Define colorbar
sigma = "\u03c3"
colorbar = dict(
title="-log<sub>10</sub>(p-value)",
tickvals=[0, 1.30, 2, 3, 5, 6.53, 10],
ticktext=[
"0",
f"1.30: p=0.05~2{sigma}",
"2: p=0.01",
f"3: p=0.001~3{sigma}",
"5",
f"6.53: p=3x10<sup>-7</sup> = 5{sigma}",
"10",
],
)
# keep NaNs in for labeling
z = table.values
text = np.where(np.isnan(z), "NS", np.char.mod("%.2f", z))
fig = go.Figure(
data=go.Heatmap(
z=np.nan_to_num(z, nan=0), # replace NaN with 0 for coloring
x=table.columns,
y=table.index,
colorscale="Blues",
zmin=0,
zmax=10,
colorbar=colorbar,
text=text,
texttemplate="%{text}",
hovertemplate="GO Term: %{y}<br>Class: %{x}<br>Value: %{z:.2f}<extra></extra>",
showscale=True,
xgap=2,
ygap=2,
)
)
fig.update_layout(
title_text=title,
width=width,
height=height,
plot_bgcolor="black",
)
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.show()
```
Graph GO for 5 cell types.
```{python}
#| label: plot-go-heatmap-5ct
#| fig-align: center
plot_go_heatmap(
table=table_5_ct,
width=600,
height=1000,
)
```
Fig 2H: Top four gene ontology (GO) terms enriched by g:Profiler[35] for genes within important Biospecimen classifier regions according to SHAP values.
#### 5 top pval terms for each biospecimen label
Collect the 5 most significants terms for each the 16 cell types.
```{python}
#| label: get-top5-go-terms
# preserve order
top_5_terms = []
for name, df in all_go_dfs.items():
df.sort_values("p_value", inplace=True)
top_5 = df["name"].head(5).to_list()
top_5_terms.extend(top_5)
top_5_terms = list(dict.fromkeys(top_5_terms))
```
Graph.
```{python}
#| label: plot-go-heatmap-top5
#| column: screen-inset-left
table_top_5_GO = table.loc[top_5_terms, :].copy()
plot_go_heatmap(
table=table_top_5_GO,
width=1300,
height=2000,
)
```
### I - Biospecimen important regions chromatin states {#sec-2I}
Images extracted from Epilogos viewer, using:
- Region: chr11:118300000-118400000
- View mode: Single
- Dataset: IHEC
- All biosamples
- Saliency Metric: S1
{width=column.page-left}
Fig. 2I: Epilogos visualization of one of the important Biospecimen classifier regions enriched in the `T cell receptor complex` GO term from [Fig. 2H](./fig2.html#sec-2H).
See [Annex A](./fig2.html#sec-annex-a) for a more detailled Epilogos color legend.
### J - Copy Number Alteration (CNA) Signatures and SHAP values
Note: The code uses the acronym CNV (Copy Number Variant/Variation) instead of CNA.
See CNV_treatment.ipynb ([permalink](https://github.com/rabyj/EpiClass/blob/e87306bccecd3e35d9ffdbb5ddb29fcaa0800a35/src/python/epiclass/utils/notebooks/paper/CNV_treatment.ipynb)) for the creation of the CNA stats. The following figures represent the Z-scores of the signal within the top SHAP features (N=336) vs 200 random feature sets of same size, for signatures created using patients samples with cancer types similar to EpiAtlas content (see paper Methods).
\
\
Load z-scores.
```{python}
#| label: load-cnv-data
cnv_dir = base_data_dir / "CNV"
cnv_intersection_results = (
cnv_dir
/ "signature_analysis"
/ "epiatlas_cancer_types"
/ "important_cancer_features_z_scores_vs_random200.tsv"
)
PathChecker.check_file(cnv_intersection_results)
cnv_df = pd.read_csv(cnv_intersection_results, sep="\t", index_col=0)
cnv_df.name = cnv_intersection_results.stem
```
Define graphing function `plot_cnv_zscores`, which uses the CNA groups defined by [Signatures of copy number alterations in human cancer](https://doi.org/10.1038/s41586-022-04738-6). See Supplementary Table 8 for details.
```{python}
#| label: func-plot-cnv-zscores
def plot_cnv_zscores(cnv_df: pd.DataFrame, logdir: Path | None = None) -> None:
"""Plot z-scores of top SHAP features vs random feature sets, grouped by CNV groups.
Args:
cnv_df: The DataFrame with z-scores.
logdir: The output directory to save the plot.
"""
n_beds = int(cnv_df.name.split("random")[1])
signature_subset_name = "EpiATLAS cancer types"
CN_groups = [
[f"CN{i}" for i in range(1, 4)],
[f"CN{i}" for i in range(9, 13)],
[f"CN{i}" for i in range(13, 17)],
[f"CN{i}" for i in range(17, 18)],
[f"CN{i}" for i in range(18, 22)],
[f"CN{i}" for i in range(4, 9)],
]
CN_names = [
"CN1-CN3",
"CN9-CN12",
"CN13-CN16",
"CN17",
"CN18-CN21",
"CN4-CN8",
]
# Assign groups to the DataFrame
cnv_df["group"] = "Other"
for i, group in enumerate(CN_groups):
cnv_df.loc[cnv_df.index.isin(group), "group"] = CN_names[i]
# Sort groups
group_medians = (
cnv_df.groupby("group")["z_score"].median().sort_values(ascending=False)
)
sorted_CN_names = group_medians.index.tolist()
# Create the figure
fig = go.Figure()
for group in sorted_CN_names:
group_data = cnv_df[cnv_df["group"] == group]
marker_size = 4 if group != "CN17" else 6
# Add the box plot without points
fig.add_trace(
go.Box(
y=group_data["z_score"],
name=group,
boxmean=True,
boxpoints=False, # Don't show points in the box plot
line=dict(color="black"),
fillcolor="rgba(255,255,255,0)",
showlegend=False,
)
)
# Add scatter plot for individual points
fig.add_trace(
go.Scatter(
x=[group] * len(group_data),
y=group_data["z_score"],
mode="markers",
marker=dict(
color="red",
size=marker_size,
),
name=group,
showlegend=False,
text=group_data.index, # Use CN names as hover text
hoverinfo="text+y", # Show CN name and y-value on hover
)
)
# Update layout
fig.update_layout(
xaxis_title="CNA Group",
yaxis_title="Z-score",
**main_title_settings
)
# Add a horizontal line at y=0 for reference
fig.add_hline(y=0, line_color="grey", line_width=1)
# Show and save the figure
if logdir:
name = "important_cancer_features_z_scores_boxplot"
fig.write_image(logdir / f"{name}.png")
fig.write_image(logdir / f"{name}.svg")
fig.write_html(logdir / f"{name}.html")
fig.show()
```
Graph.
```{python}
#| label: plot-cnv-zscores
#| column: body-outset-left
plot_cnv_zscores(cnv_df)
```
Define the graphing function `plot_cnv_zscores_alt`, which groups CNA by focal vs chromosomal events.
```{python}
#| label: func-plot-cnv-zscores-alt
def plot_cnv_zscores_alt(cnv_df: pd.DataFrame, logdir: Path | None = None) -> None:
"""Plot z-scores of top SHAP features vs random feature sets, grouped by focal vs chromosomal events.
Args:
cnv_df: The DataFrame with z-scores.
logdir: The output directory to save the plot.
"""
n_beds = int(cnv_df.name.split("random")[1])
signature_subset_name = "EpiATLAS cancer types"
# Assign groups
CN_groups = [
[f"CN{i}" for i in range(1, 9)] + [f"CN{i}" for i in range(13, 17)],
[f"CN{i}" for i in range(9, 13)] + [f"CN{i}" for i in range(17, 22)],
]
CN_names = [
"Chromosomal events (CN1-CN8, CN13-CN16)",
"Focal events (CN9-CN12, CN17-CN21)",
]
# Assign groups to the DataFrame
cnv_df["group"] = "Other"
for i, group in enumerate(CN_groups):
cnv_df.loc[cnv_df.index.isin(group), "group"] = CN_names[i]
# Sort groups
group_medians = (
cnv_df.groupby("group")["z_score"].median().sort_values(ascending=False)
)
sorted_CN_names = group_medians.index.tolist()
# Create the figure
fig = go.Figure()
for group in sorted_CN_names:
group_data = cnv_df[cnv_df["group"] == group]
hover_text = [
f"{CN_name}: Z={val:.3f}"
for CN_name, val in zip(group_data.index, group_data["z_score"])
]
# Add the box plot without points
fig.add_trace(
go.Box(
y=group_data["z_score"],
name=group,
boxmean=True,
boxpoints="all",
line=dict(color="black"),
fillcolor="rgba(255,255,255,0)",
showlegend=False,
hoverinfo="text", # Show CN name and y-value on hover
hovertext=hover_text,
)
)
# Update layout
fig.update_layout(
xaxis_title="Copy Number Alteration Event Type",
yaxis_title="CNA Enrichment (Z-score)",
**main_title_settings
)
# Add a horizontal line at y=0 for reference
fig.add_hline(y=0, line_color="grey", line_width=1)
# Show and save the figure
if logdir:
name = "important_cancer_features_z_scores_boxplot_V2"
fig.write_image(logdir / f"{name}.png")
fig.write_image(logdir / f"{name}.svg")
fig.write_html(logdir / f"{name}.html")
fig.show()
```
Graph.
```{python}
#| label: plot-cnv-zscores-alt
#| column: body-outset-left
plot_cnv_zscores_alt(cnv_df)
```
Fig. 2J: Distribution of the z-score enrichment for copy number (CN) signatures (dots) in genomic regions identified as important by the Cancer status classifier compared to random control regions. CN signatures are grouped as being mostly associated with focal changes (CN9-12,17-21) or chromosomal ones (CN1-8,13-16). Boxplot elements are as in Fig 2A.
## Supplementary Figures 5 to 7
See [Supplementary Figures page](./fig2-supp.html) (separate to limit maximum website page size and loading times).
## Annex A - Epilogos color legend {#sec-annex-a}
{width=.column-body}
- biv = bivalent
- Tx = Transcription
- Repr = Repressed
- PC = Polycomb